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Abstract 

We address the problem of learning in an on- 
line setting where the learner repeatedly ob- 
serves features, selects among a set of actions, 
and receives reward for the action taken. We 
provide the first efficient algorithm with an 
optimal regret. Our algorithm uses a cost 
sensitive classification learner as an oracle 
and has a running time polylog(-ZV), where N 
is the number of classification rules among 
which the oracle might choose. This is expo- 
nentially faster than all previous algorithms 
that achieve optimal regret in this setting. 
Our formulation also enables us to create an 
algorithm with regret that is additive rather 
than multiplicative in feedback delay as in all 
previous work. 



1 INTRODUCTION 

The contextual bandit setting consists of the following 
loop repeated indefinitely: 

1. The world presents context information as fea- 
tures x. 

2. The learning algorithm chooses an action a from 
K possible actions. 

3. The world presents a reward r for the action. 

The key difference between the contextual bandit set- 
ting and standard supervised learning is that only the 
reward of the chosen action is revealed. For example, 
after always choosing the same action several times 
in a row, the feedback given provides almost no ba- 
sis to prefer the chosen action over another action. 
In essence, the contextual bandit setting captures the 
difficulty of exploration while avoiding the difficulty 



of credit assignment as in more general reinforcement 
learning settings. 

The contextual bandit setting is a half-way point be- 
tween standard supervised learning and full-scale re- 
inforcement learning where it appears possible to con- 
struct algorithms with convergence rate guarantees 
similar to supervised learning. Many natural settings 
satisfy this half-way point, motivating the investiga- 
tion of contextual bandit learning. For example, the 
problem of choosing interesting news articles or ads for 
users by internet companies can be naturally modeled 
as a contextual bandit setting. In the medical domain 
where discrete treatments are tested before approval, 
the process of deciding which patients are eligible for 
a treatment takes contexts into account. More gener- 
ally, we can imagine that in a future with personalized 
medicine, new treatments are essentially equivalent to 
new actions in a contextual bandit setting. 

In the i.i.d. setting, the world draws a pair (a;, r) con- 
sisting of a context and a reward vector from some 
unknown distribution D, revealing x in Step 1, but 
only the reward r(a) of the chosen action a in Step 3. 
Given a set of policies II = {ir : X — » A}, the goal 
is to create an algorithm for Step 2 which competes 
with the set of policies. We measure our success by 
comparing the algorithm's cumulative reward to the 
expected cumulative reward of the best policy in the 
set. The difference of the two is called regret. 

All existing algorithm s for this setting either ach ieve 
a suboptimal regret ([Langford and Zhand. l2007t) or 
requi re computation linear in the number of poli- 



cies (jAuer et all 12002b; B evgelzimer et al.l . l201lh . In 



unstructured policy spaces, this computational com- 
plexity is the best one can hope for. On the other 
hand, in the case where the rewards of all actions are 
revealed, the problem is equivalent to cost-sensitive 
classification, and we know of algorithms to efficiently 
search the space of policies (classification rules) such 
as cost-sensitive logistic regression and support vec- 
tor machines. In these cases, the space of classifica- 



tion rules is exponential in the number of features, but 
these problems can be efficiently solved using convex 
optimization. 

Our goal here is to efficiently solve the contextual 
bandit problems for similarly large policy spaces. 
We do this by reducing the contextual bandit prob- 
lem to cost-sensitive classification. Given a su- 
pervi sed cost-sensitive l e arnin g algorithm as an or- 
acle ( Beygelzimer et all H009), our algorithm runs 
in time only polylog(TV) while achieving regret 
0{s/TK\a. N), where N is the number of possible poli- 
cies (classification rules), K is the number of actions 
(classes), and T is the number of time steps. This effi- 
ciency is achieved in a modular way, so any future im- 
provement in cost-sensitive learning immediately ap- 
plies here. 

1.1 PREVIOUS WORK AND 
MOTIVATION 

All previous regret-optimal approaches are measure 
based — they work by updating a measure over poli- 
cies, an operation which is linear in the number of 
policies. In contrast, regret guarantees scale only log- 
arithmically in the number of policies. If not for the 
computational bottleneck, these regret guarantees im- 
ply that we could dramatically increase performance in 
contextual bandit settings using more expressive poli- 
cies. We overcome the computational bottleneck using 
an algorithm which works by creating cost-sensitive 
classification instances and calling an oracle to choose 
optimal policies. Actions are chosen based on the 
policies returned by the oracle rather than accord- 
ing to a meas ure over all policies. This is reminiscent 
of AdaBoost (jFreund and Schapird . Il997l ) , which cre- 
ates weighted binary classification instances and calls 
a "weak learner" oracle to obtain classification rules. 
These classification rules are then combined into a fi- 
nal classifier with boosted accuracy. Similarly as Ad- 
aBoost converts a weak learner into a strong learner, 
our approach converts a cost-sensitive classification 
learner into an algorithm that solves the contextual 
bandit problem. 

In a more difficult version of contextual bandits, an ad- 
versary chooses (x,r) given knowledge of the learning 
algorithm (but not any random numbers). All known 
regret-optimal solutions in the a dversarial se t ting ar e 



variants of the EXP4 algorithm (jAuer et all l2002bl ). 



EXP4 achieves the same regret rate as our algorithm: 
O (^fKT In N^j , where T is the number of time steps, 
K is the number of actions available in each time step, 
and TV is the number of policies. 



to su cceed with high probability (jBevgelzimer et al 
20 1 lh . and also for VC classes when the adversary is 
constrained to i.i.d. sampling. There are two central 
benefits that we hope to realize by directly assuming 
i.i.d. contexts and reward vectors. 

1. Computational Tractability. Even when the re- 
ward vector is fully known, adversarial regrets 
scale as O ^v / hT/vj while computation scales 

as 0(N) in general. One attempt to get 
around t his is the follow-the-perturb ed-leader al- 
gorithm ( Kalai and Vempalal 12005? ) which pro- 
vides a computationally tractable solution in cer- 
tain special-case structures. This algorithm has 
no mechanism for efficient application to arbitrary 
policy spaces, even given an efficient cost-sensitive 
classification oracle. An efficient cost-sensitive 
classification oracle h as been shown effe c tive i n 
transductive settings ( Kakade and Kalai . 2005h . 
Aside from the drawback of requiring a transduc- 
tive setting, the regret achieved there is substan- 
tially worse than for EXP4. 

2. Improved Rates. When the world is not com- 
pletely adversarial, it is possible to achieve sub- 
stantially lower regrets than are possible with al- 
gorithms optimized for the adversarial setting. 
For example, in supervised learning, it is possible 
to obtain regrets scaling as Q(log(T)) with a prob - 



lem dependent constant (jBartlett et all 120071 ). 



When the feedback is delayed by t rounds, lower 
bounds imply that the regret in the adversarial 
setting increases by a multiplicative yfr while in 
the i.i.d. setting , it is possible to achie ve an addi- 
tive regret of r ( Langford et al. , 20091) . 



In a direct i.i.d. setting, the previous-best ap- 
proach using a cost-sensitive classification oracle 
was gi ven by e-greedy and ep och greedy algo- 
rithms (jLangford and Zhanei 2007 ) which have a re- 
gret scaling as 0(T 2 / 3 ) in the worst case. 

There have also been many special-case analyses. For 
example, theory of context-free setting is well un- 
derstood (lLai and Robbind . Il985t lAuer et all . l2002at 
Even-Par et all 2006f ). Similarly, good algorith ms ex- 
ist when rewards are linear functions of features ( Auer , 



2002) or actions lie in a continuous space with the re- 
ward fun£tiOTi_s^ : mpledaccording to a Gaussian pro- 
cess ( Srinivas et all l201dh . 



1.2 WHAT WE PROVE 



Why not use EXP4 in the i.i.d. setting? For exam- 
ple, it is known that the algorithm can be modified 



In Section [3] we state the PolicyElimination algo- 
rithm, and prove the following regret bound for it. 



Theorem [4j For all distributions D over (x, r) with 
K actions, for all sets of N policies II, with probabil- 
ity at least 1 — 6, the regret of PolicyElimination 
(Algorithm [1} over T rounds is at most 



16\l2TK\n — — . 

This result can be extended to deal with VC classes, 
as well as other special cases. It forms the simplest 
method we have of exhibiting the new analysis. 

The new key element of this algorithm is identifica- 
tion of a distribution over actions which simultane- 
ously achieves small expected regret and allows esti- 
mating value of every policy with small variance. The 
existence of such a distribution is shown nonconstruc- 
tively by a minimax argument. 

PolicyElimination is computationally intractable 
and also requires exact knowledge of the context dis- 
tribution (but not the reward distribution!). We show 
how to address these issues in Section 2] using an algo- 
rithm we call RandomizedUCB. Namely, we prove 
the following theorem. 

Theorem [5l For all distributions D over (x, f] with 
K actions, for all sets of N policies II, with proba- 
bility at least 1 — 6, the regret of RandomizedUCB 
(Algorithm \2§ over T rounds is at most 

O [y/TK log (TN/6) + K log(NK/6)j . 

RandomizedUCB's analysis is substantially more 
complex, with a key subroutine being an ap- 
plication of the ellipsoid algorithm with a cost- 
sensitive classification oracle (described in Section [5]) . 
RandomizedUCB does not assume knowledge of the 
context distribution, and instead works with the his- 
tory of contexts it has observed. Modifying the 
proof for this empirical distribution requires a cov- 
ering argument over the distributions over policies 
which uses the probabilistic method. The net result 
is an algorithm with a similar top-level analysis as 
PolicyElimination, but with the running time only 
poly- logarithmic in the number of policies given a cost- 
sensitive classification oracle. 

Theorem [TTl In each time step t, RandomizedUCB 
makes at most 0(poly(t, K, log(l/<5), logiV)) calls to 
cost-sensitive classification oracle, and requires addi- 
tional 0(poly(t, K, log N)) processing time. 

Apart from a tractable algorithm, our analysis can be 
used to derive tighter regrets than would be possible in 
adversarial setting. For example, in Section[6l we con- 
sider a common setting where reward feedback is de- 
layed by r rounds. A straightforward modification of 



PolicyElimination yields a regret with an additive 
term proportional to r compared with the delay-free 
setting. Namely, we prove the following. 

Theorem 1121 For all distributions D over (x, f) with 
K actions, for all sets of N policies II, and all delay 
intervals r, with probability at least 1 — 6, the regret 
of DelayedPE (Algorithm [3]) is at most 

lQj2K\n—^—(r + VT) . 

We start next with precise settings and definitions. 

2 SETTING AND DEFINITIONS 
2.1 THE SETTING 

Let A be the set of K actions, let X be the domain of 
contexts x, and let D be an arbitrary joint distribution 
on (x,r). We denote the marginal distribution of D 
over X by Dx- 

We denote II to be a finite set of policies {it : X — > A}, 
where each policy n, given a context x t in round t, 
chooses the action 7r(ir t ). The cardinality of II is de- 
noted by N. Let ft £ [0, 1]^ be the vector of rewards, 
where r t {a) is the reward of action a on round t. 

In the i.i.d. setting, on each round t = 1 . . . T, the 
world chooses (x t ,f t ) i.i.d. according to D and reveals 
Xt to the learner. The learner, having access to LI, 
chooses action a t £ {1, . . . , K}. Then the world reveals 
reward r t {at) (which we call r t for short) to the learner, 
and the interaction proceeds to the next round. 

We consider two modes of accessing the set of policies 
II. The first option is through the enumeration of all 
policies. This is impractical in general, but suffices 
for the illustrative purpose of our first algorithm. The 
second option is an oracle access, through an argmax 
oracle, corresponding to a cost-sensitive learner: 

Definition 1. For a set of policies II, an argmax or- 
acle (AAiO for short), is an algorithm, which for any 
sequence {{x t ' ,f t ')} t '=i...t, x t > £ X, r t > £ R K , com- 
putes 

argmax V, r t i (n(x t ')) ■ 
t'=i...t 

The reason why the above can be viewed as a cost- 
sensitive classification oracle is that vectors of rewards 
ff can be interpreted as negative costs and hence the 
policy returned by AA40 is the optimal cost-sensitive 
classifier on the given data. 



2.2 EXPECTED AND EMPIRICAL 
REWARDS 



We can also define notions of regret and empirical re- 
gret for policies tt. For all it E II, let 



Let the expected instantaneous reward of a policy n € 
II be denoted by 

77d(tt) = E [r(7r(x))] . 

(x,r)~D 

The best policy 7r max 6 II is that which maximizes 
77.0(71-). More formally, 

TTmax = argmax r) D (it) . 
wen 

We define h t to be the history at time i that the learner 
has seen. Specifically 



ht= [J (x t ',at',r t >, 



Pt> 



i'=l...t 



where pt> is the probability of the algorithm choosing 
action ap at time t' . Note that ay and pt' are produced 
by the learner while x# , are produced by nature. 
We write x ~ h to denote choosing a; uniformly at 
random from the s's in history h. 

Using the history of past actions and probabilities with 
which they were taken, we can form an unbiased esti- 
mate of the policy value for any 7r € II: 



r\(j{(x) = a) 



(x,a,r,p)£Lht 

The unbiasedness follows, because E a ~r> ™Mp—2d — 

P( a ) rl ' y7 'p{l)~ a " > = r ( 7r ( a; ))- The empirically best 
policy at time t is denoted 



7r t = argmax 77 t (7r). 
?ren 



2.3 REGRET 

The goal of this work is to obtain a learner that has 
small regret relative to the expected performance of 
7r max over T rounds, which is 



Y (^(TTmax) 



(2.1) 



t=l...T 



We say that the regret of the learner over T rounds is 
bounded by e with probability at least 1 — 8, if 



Pr 



.4=1. ..T 



(?7_D(7Tmax) ~ T t ) < C 



> 1 



where the probability is taken with respect to the ran- 
dom pairs (xt, 7%) ~ D for t = 1 . . .T, as well as any 
internal randomness used by the learner. 



A D (ir) = 7773(7r max ) - rj D (n) , 
A t (71-) = 77 t (7r t ) - 77i(7r) . 

Our algorithms work by choosing distributions over 
policies, which in turn then induce distributions over 
actions. For any distribution P over policies II, let 
Wp(x,a) denote the induced conditional distribution 
over actions a given the context x: 



W P (x,a)= P W 



(2.2) 



TGiI:7r(a:)— a 



In general, we shall use W, W and Z as conditional 
probability distributions over the actions A given con- 
texts X, i.e., W : X x A — > [0, 1] such that W(x, ■) is a 
probability distribution over A (and similarly for W 
and Z). We shall think of W MS i\ smoothed version 
of W with a minimum action probability of /i (to be 
defined by the algorithm), such that 

W'(x, a) = (1 - Kfj,)W(x, a) + (i . 

Conditional distributions such as W (and W' , Z, etc.) 
correspond to randomized policies. We define notions 
true and empirical value and regret for them as follows: 

Vd(W)= E [r-W(x)} 

(x,r)~D 



<»'> = ! E 



rW(x, a) 



P 



A D (W) = 77D(7r max ) - VD (W) 
A t (WO = VtM - Vt(W) . 

3 POLICY ELIMINATION 

The basic ideas behind our approach are demonstrated 
in our first algorithm: POLICYELIMINATION (Algo- 
rithm [T]) . 

The key step is Step [TJ which finds a distribution over 
policies which induces low variance in the estimate of 
the value of all policies. Below we use minimax the- 
orem to show that such a distribution always exists. 
How to find this distribution is not specified here, but 
in Section [5] we develop a method based on the ellip- 
soid algorithm. Step [2] then projects this distribution 
onto a distribution over actions and applies smoothing. 
Finally, Step [5] eliminates the policies that have been 
determined to be suboptimal (with high probability). 

ALGORITHM ANALYSIS 

We analyze POLICYELIMINATION in several steps. 
First, we prove the existence of P t in StepQ] provided 



Algorithm 1 PolicyElimination(II,($,.K',.Dx) 
Let LTo = II and history ho = 
Define: 5 t = 5 / ANt 2 



Define: b± = 2 



2K\n(l/5 t ) 
t 



1 



ln(l/<5 t ) 



Define: u t = min ■«,,., 

p |2# ' V 2A t 

For each timestep t = 1 . . . T, observe Xt and do: 
1. Choose distribution P t over IT t _i s.t. V tt G II t _x: 



E 

x~D > 



1 



< 2K 



(1 - Knt)W Pt (x,ir(x)) + fM. 

2. Let W((a) = (l-Kfj, t )Wp t (x t ,a)+fj, t for all a G A 

3. Choose a t - W[ 

4. Observe reward r t 

5. Let IL = |tt G IT f _i : 

774(71") > ( max ^(tt') ) - 26 t I 

6. Let h t = h t -i U (x t , a t , r t , W/(a t )) 



that IT t _i is non-empty. We recast the feasibility prob- 
lem in Step [T] as a game between two players: Prover, 
who is trying to produce Pt, and Falsifier, who is try- 
ing to find tt violating the constraints. We give more 
power to Falsifier and allow him to choose a distribu- 
tion over tt (i.e., a randomized policy) which would 
violate the constraints. 

Note that any policy tt corresponds to a point in 
the space of randomized policies (viewed as functions 
X x A -)• [0,1]), with 7r(ar,o) = 1{tt(x) = a). For 
any distribution P over policies in the induced 

randomized policy Wp then corresponds to a point in 
the convex hull of LT t i . Denoting the convex hull of 
n t i by C, Prover's choice by W and Falsifier's choice 
by Z , the feasibility of Step Q] follows by the following 
lemma: 

Lemma 1. Let C be a compact and convex set of ran- 
domized policies. Let [i G (0, 1/K] and for any W G C, 
W'{x,a) == (1 — K[i)W(x.a) + /i. Then for all distri- 
butions D, 



mm max E E 

wee zee x ~d x 



1 



W'(x,a) 



< 



K 



1 -Kfj, 



Proof. Let f(W,Z) = E x ~D x Ea~z(x,-)[l/W'(x,a)] 
denote the inner expression of the minimax problem. 
Note that f(W, Z) is: 

• everywhere defined: Since W'(x, a) > /i, we ob- 



tain that 1/W'(x,a) G [0,1/ fi], hence the expec- 
tations are defined for all W and Z. 

linear in Z: Linearity follows from rewriting 
f(W, Z) as 



f(W,Z) 



E 



E 



Z(x, a) 



W'{x,a) 



• convex in W: Note that 1/W'[x, a) is convex in 
W{x, a) by convexity of l/(ciw + c 2 ) in w > 0, for 
ci > 0, c 2 > 0. Convexity of f(W, Z) in W then 
follows by taking expectations over x and a. 

Hence, by Theorem [14] (in Appendix |B|) . min and max 
can be reversed without affecting the value: 

min max f(W, Z) = max min f (W, Z) . 

w&c zee Jy ' zee wee y ' 

The right-hand side can be further upper-bounded by 
msxzec f(Z, Z), which is upper-bounded by 



f(Z,Z)= E n £ 



x~D x ^ 



a£A 



Z(x, a) 
Z'(x, a) 

Z(x, a) 



aeA: 
Z(x,a)>0 



(1 -Kfi)Z(x,a) 



K 



1 - A> 



□ 



Corollary 2. The set of distributions satisfying con- 
straints of Step]]] is non-empty. 

Given the existence of P t , we will see below that the 
constraints in Step Q] ensure low variance of the policy 
value estimator 774(71") for all tt G lit— i- The small vari- 
ance is used to ensure accuracy of policy elimination 
in Step [5] as quantified in the following lemma: 

Lemma 3. With probability at least 1 — 5, for all t: 

L TTmax € II t (i.e., II t is non-empty) 

2. 7?u(7r roax ) - r) D (ir) < 46 t for all tt G Tl t 

Proof. We will show that for any policy tt G Ht-i, the 
probability that 774(71") deviates from 7715(71") by more 
that bt is at most 2<5 f . Taking the union bound over all 
policies and all time steps we find that with probability 
at least 1 — 6, 



Mtt) - 77d(tt)| < b t 
for all t and all tt G Ht-i- Then: 



(3.1) 



1. By the triangle inequality, in each time step, 
Vt(ir) < %(7r max ) + 26 t for all tt G II t _i, yield- 
ing the first part of the lemma. 



2. Also by the triangle inequality, if t]d(tv) < 
f7_D(7r m a X ) - 4b t for 7r <E n f _i, then r)t(ir) < 
Vti^max) ~ 26t. Hence the policy 7r is eliminated 
in Step [SJ yielding the second part of the lemma. 

It remains to show Eq. (|3.ip . We fix the policy tt G II 
and time t, and show that the deviation bound is vi- 
olated with probability at most 26t- Our argument 
rests on Freedman's inequality (see Theorem[T3]in Ap- 
pendix [SJ. Let 



Vi 



r t l{n{x t ) = a t ) 
W((a t ) 



i.e., rit{it) — (X)t'=i ft') A' Let E f denote the con- 
ditional expectation E[ ■ |/it_i]. To use Freedman's 
inequality, we need to bound the range of y t and its 
conditional second moment E t [y 2 ]. 

Since r t £ [0, 1] and W t '(a t ) > fa, we have the bound 
< yt < 1//H = Rt ■ 



Next, 



UVt] 



E E [yt\ 

(x t ,r t )~D a t ~W[ 

>?I(7r(x t ) 
EE' 



at) 



W/(ot) 



< E 

(a t ,ft)~D 



E 



^'(tt^)) 



1 



< 2K 



(3.2) 
(3.3) 



where Eq. (13. 2|) follows by boundedness of r t and 
Eq. (|3.3p follows from the constraints in Step[TJ Hence, 



X ^f[y 2 t>]<iKt = v t 



-\...t 



Since (lni)/i is decreasing for t > 3, we obtain that /Lit 
is non-increasing (by separately analyzing t = 1, t = 2, 
t > 3). Let ^ be the first £ such that \i t < 1/2K. 
Note that b t > 4:Kfi t , so for t < t , we have b t > 2 and 
n t = n. Hence, the deviation bound holds for t < to. 

Let t > to. For t' < i, by the monotonicity of fi t 



Rt' = 1M' < 1M 



2Kf 



ln(l/J t ) I/ Ml/St) 



Hence, the assumptions of Theorem [13] are satisfied, 
and 

Pr [1^(0-770(701 >b t ] <2S t . 



This immediately implies that the cumulative regret is 
bounded by 



, / / <±NT 2 ^ 1 



t=l...T 



4T 2 N 

< 16i/2rif In — - — (3.4) 



and gives us the following theorem. 

Theorem 4. For all distributions D over (x, r) with 
K actions, for all sets of N policies n, with probabil- 
ity at least 1 — 5, the regret of POLICYELIMINATION 
(Algorithm^) over T rounds is at most 



AT 2 N 
l&dlTKhx—^— . 

4 THE RANDOMIZED UCB 
ALGORITHM 

PolicyElimination is the simplest exhibition of the 
minimax argument, but it has some drawbacks: 

1. The algorithm keeps explicit track of the space 
of good policies (like a version space), which is 
difficult to implement efficiently in general. 

2. If the optimal policy is mistakenly eliminated by 
chance, the algorithm can never recover. 

3. The algorithm requires perfect knowledge of the 
distribution Dx over contexts. 

These difficulties are addressed by RandomizedUCB 
(or RUCB for short), an algorithm which we present 
and analyze in this section. Our approach is r eminis- 
cent of the UCB algorithm ([Auer et al.l . l2002al) . devel- 
oped for context-free setting, which keeps an upper- 
confidence bound on the expected reward for each ac- 
tion. However, instead of choosing the highest upper 
confidence bound, we randomize over choices accord- 
ing to the value of their empirical performance. The 
algorithm has the following properties: 



The union bound over 7r and t yields Eq. (|3.1 



□ 



1. The optimization step required by the algorithm 
always considers the full set of policies (i.e., 
explicit tracking of the set of good policies is 
avoided), and thus it can be efficiently imple- 
mented using an argmax oracle. We discuss this 
further in Section [5j 

2. Suboptimal policies are implicitly used with de- 
creasing frequency by using a non-uniform vari- 
ance constraint that depends on a policy's esti- 
mated regret. A consequence of this is a bound on 
the value of the optimization, stated in Lemma [7] 
below. 



Algorithm 2 R,ANDOMiZEDUCB(n,5,if ) 

Let ho = be the initial history. 
Define the following quantities: 



' It) and /Xi = min i' \m 



For each timestep t = 1 . . . T, observe a; t and do: 

1. Let Pt be a distribution over II that approxi- 
mately solves the optimization problem 

7rGn 

s.t. for all distributions Q over LT : 
t-i 



E 

1T~Q 



-y- 

-i^ a 



i 



< max <^ 



^ (1 - K(jL t )Wp(xi,ir(xi)) + fit 



180C t _i 

(4.1) 

so that the objective value at P t is within £ op t t = 
0(yj KCt/t) of the optimal value, and so that 
each constraint is satisfied with slack < K. 

2. Let W[ be the distribution over A given by 

W[{a) = (1 - tfMt)Wp 4 (at, a) + & 
for all a € A. 

3. Choose a t - W(. 

4. Observe reward ft- 

5. Let h t = ht-i U {x t ,a t , r t ,W[(a t )). 



3. Instead of Dx, the algorithm uses the history of 
previously seen contexts. The effect of this ap- 
proximation is quantified in Theorem [6] below. 



4.1 EMPIRICAL VARIANCE ESTIMATES 

A key technical prerequisite for the regret analysis is 
the accuracy of the empirical variance estimates. For 
a distribution P over policies II and a particular policy 
7r e II, define 



-D* 



VP,*,: 



t-1 
V 



(l-*TAi t )Wp(s,7r(a!))+/i ( 
1 



t 



1 ^ (1 - ^ t )VFp(a;i, 7r(a!i)) + ^ ' 



The first quantity Vp^,t is (a bound on) the vari- 
ance incurred by an importance-weighted estimate of 
reward in round t using the action distribution in- 
duced by P, and the second quantity Vp^j is an 
empirical estimate of Vp^,t using the finite sample 
{xi, . . . , Xt—i} C X drawn from Dx- We show that 
for all distributions P and all tt g II, Vp l7r ,t is close to 
Vp )7r t with high probability. 

Theorem 6. For any e £ (0,1), with probability at 
least 1 — 5, 

V P ,^ t < (1 + e) ■ V P , v , t +™-K 

for all distributions P over II, all tt G II, and all t > 
16inog(8ifiV/J). 

The proof appears in Appendix [Cj 
4.2 REGRET ANALYSIS 

Central to the analysis is the following lemma that 
bounds the value of the optimization in each round. It 
is a direct corollary of Lemma [M] in Appendix ID. 41 

Lemma 7. If OPT f is the value of the optimization 
problem (|4.1j) in round t, then 



OPT, < O 



' KCt-i 
t- 1 



= O 



'K\og(Nt/8) 



The regret of RandomizedUCB is the following: 

Theorem 5. For all distributions D over (x,r) with 
K actions, for all sets of N policies II, with proba- 
bility at least 1 — 8, the regret of RANDOMIZEDUCB 
(Algorithm^ over T rounds is at most 

O U/TK log {TN/S) + K \og(NK/5)j . 

The proof is given in Appendix ID. 41 Here, we present 
an overview of the analysis. 



This lemma implies that the algorithm is always able 
to select a distribution over the policies that focuses 
mostly on the policies with low estimated regret. 
Moreover, the variance constraints ensure that good 
policies never appear too bad, and that only bad poli- 
cies are allowed to incur high variance in their reward 
estimates. Hence, minimizing the objective in (|4.1[) is 
an effective surrogate for minimizing regret. 

The bulk of the analysis consists of analyzing the 
variance of the importance- weighted reward estimates 
r/ t (tt), and showing how they relate to their actual ex- 
pected rewards r/o (7r). The details are deferred to Ap- 
pendix H2 



5 USING AN ARGMAX ORACLE 



Consider the following convex program: 



In this section, we show how to solve the optimization 
problem (|4.ip using the argmax oracle (AA40) for our 
set of policies. Namely, we describe an algorithm run- 
ning in polynomial time independent of the number of 
policies, which makes queries to AAiO to compute a 
distribution over policies suitable for the optimization 
step of Algorithm^ 

This algorithm relies on the ellipsoid method. The el- 
lipsoid method is a general technique for solving con- 
vex programs equipped with a separation oracle. A 
separation oracle is defined as follows: 

Definition 2. Let S be a convex set in R™. A sepa- 
ration oracle for S is an algorithm that, given a point 
x e R™, either declares correctly that x £ S, or pro- 
duces a hyperplane H such that x and S are on oppo- 
site sides of H . 

We do not describe the ellipsoid algorithm here (since 
it is standard), but only spell out its key properties in 
the following lemma. For a point x £ R n and r > 0, 
we use the notation B(x,r) to denote the li ball of 
radius r centered at x. 

Lemma 8. Suppose we are required to decide whether 
a convex set S C R™ is empty or not. We are given 
a separation oracle for S and two numbers R and r, 
such that S € B(0, R) and if S is non-empty, then 
there is a point x* such that S 2 B(x*,r). The ellip- 
soid algorithm decides correctly if S is empty or not, 
by executing at most 0(n 2 log (-^)) iterations, each in- 
volving one call to the separation oracle and additional 
0{n 2 ) processing time. 

We now write a convex program whose solution is the 
required distribution, and show how to solve it using 
the ellipsoid method by giving a separation oracle for 
its feasible set using AM.O. 

Fix a time period t. Let X t ~\ be the set of all con- 
texts seen so far, i.e. Xt-i = {xi,%2t ■ ■ ,xt-i}- We 
embed all policies tt € LI in R^* -1 ^, with coordinates 
identified with (x,a) € Xt-i x A. With abuse of no- 
tation, a policy 7r is represented by the vector 7r with 
coordinate 7v(x, a) = 1 if n(x) = a and otherwise. 
Let C be the convex hull of all policy vectors tt. Re- 
call that a distribution P over policies corresponds to 
a point inside C, i.e., W P (x, a) = ^2 n .^ x )- a - p ( 7r )' and 
that W'(x, a) = (1 — fi t K)W(x, a) + /it, where fi t is as 

In 



t-i 

180CV 



defined in Algorithm [5J Also define f3 t 
the following, we use the notation x ~ ht—i to denote 
a context drawn uniformly at random from X t _\. 



min s s.t. 

w e c 

VZ eC: 



E 



E 



Z(x, a) 
W'{x, a) 



(5.1) 
(5.2) 

< max{4^,/3 t A t _i(Z) 2 } (5.3) 



Or rather dependent only on log JV, the representation 
size of a policy. 



We claim that this program is equivalent to the RUCB 
optimization problem (|4.ip . up to finding an explicit 
distribution over policies which corresponds to the op- 
timal solution. This can be seen as follows. Since we 
require W € C, it can be interpreted as being equal 
to Wp for some distribution over policies P. The con- 
straints (|5.3p are equivalent to (|4.1|) by substitution 
Z = W Q . 

The above convex program can be solved by perform- 
ing a binary search over s and testing feasibility of 
the constraints. For a fixed value of s, the feasibility 
problem defined by (|5.1[) - (|5.3[) is denoted by A. 

We now give a sketch of how we construct a separa- 
tion oracle for the feasible region of A. The details 
of the algorithm are a bit complicated due to the fact 
that we need to ensure that the feasible region, when 
non-empty, has a non-negligible volume (recall the re- 
quirements of Lemma [5]) . This necessitates having a 
small error in satisfying the constraints of the program. 
We leave the details to Appendix |E] Modulo these de- 
tails, the construction of the separation oracle essen- 
tially implies that we can solve A. 

Before giving the construction of the separation ora- 
cle, we first show that AAiO allows us to do linear 
optimization over C efficiently: 

Lemma 9. Given a vector w £ R( t-1 )^", we can com- 
pute argmax^gc u) ■ Z using one invocation of AMO . 

Proof. The sequence for AftAO consists of x? G X t -\ 
and TV (a) = w(xf,a). The lemma now follows since 

We need another simple technical lemma which ex- 
plains how to get a separating hyperplane for viola- 
tions of convex constraints: 

Lemma 10. For x 6 R n , let f(x) be a convex function 
of x, and consider the convex set K defined by K = 
{x : f{x) < 0}. Suppose we have a point y such that 
f(y) > 0. Let V/(y) be a subgradient of f at y. Then 
the hyperplane f{y) + V/(y) ■ (x — y) — separates y 
from K . 

Proof. Let g(x) = f{y) + V/(y) • (a: - y). By the 
convexity of /, we have f(x) > g{x) for all x. Thus, 



for any x G K, we have g{x) < f(x) < 0. Since 
g(y) = f(y) > 0, we conclude that g(x) = separates 
y from K. □ 

Now given a candidate point W, a separation oracle 
can be constructed as follows. We check whether W 
satisfies the constraints of A. If any constraint is vi- 
olated, then we find a hyperplane separating W from 
all points satisfying the constraint. 

1. First, for constraint (|5.1[) . note that rjt-i(W) is 
linear in W, and so we can compute max^ 7y t _i(7r) 
via AAiO as in Lemma [5J We can then compute 
Vt-i(W) and check if the constraint is satisfied. If 
not, then the constraint, being linear, automati- 
cally yields a separating hyperplane. 



2. Next, we consider constraint (|5.2|) . To check if 
W € C, we use the perceptron algorithm. We 
shift the origin to W, and run the perceptron al- 
gorithm with all points ir e II being positive ex- 
amples. The perceptron algorithm aims to find a 
hyperplane putting all policies it 6 LI on one side. 
In each iteration of the perceptron algorithm, we 
have a candidate hyperplane (specified by its nor- 
mal vector), and then if there is a policy ir that is 
on the wrong side of the hyperplane, we can find 
it by running a linear optimization over C in the 
negative normal vector direction as in Lemma [9] 

HW ^ C, then in a bounded number of iterations 
(depending on the distance of W from C, and the 
maximum magnitude 1 1 ti - 1 1 2 ) we obtain a separat- 
ing hyperplane. In passing we also note that if 
W £ C, the same technique allows us to explic- 
itly compute an approximate convex combination 
of policies in II that yields W . This is done by 
running the perceptron algorithm as before and 
stopping after the bound on the number of iter- 
ations has been reached. Then we collect all the 
policies we have found in the run of the percep- 
tron algorithm, and we are guaranteed that W is 
close in distance to their convex hull. We can then 
find the closest point in the convex hull of these 
policies by solving a simple quadratic program. 



Finally, we consider constraint (|5.3p . We rewrite 
r] t -i(W) as rit-i{W) — w ■ W, where w(x t >,a) = 
r t -I(a = a v )/W' t ,{a v ). Thus, A t _ x (Z) = v-w-Z, 
where v = max x ' 774-1(7/) = max T / w • n', which 
can be computed by using AM.O once. 

Next, using the candidate point W, compute the 



vector u defined as u{x, a) 



where n x 



W'(x,a) ' 

is the number of times x appears in ht-i, so that 

Z(x,a) 



reduces to finding a policy Z G C which violates 
the constraint 

u ■ Z < max{4if, f3 t (w ■ Z - v) 2 }. 

Define /(Z) = max{4K, (3 t (wZ-v) 2 }-u-Z. Note 
that / is a convex function of Z. Finding a point 
Z that violates the above constraint is equivalent 
to solving the following (convex) program: 



f(Z) < 

z e c 



(5.4) 
(5.5) 



To do this, we again apply the ellipsoid method. 
For this, we need a separation oracle for the pro- 
gram. A separation oracle for the constraints (|5.5p 
can be constructed as in Step 2 above. For the 
constraints (I5.4p . if the candidate solution Z has 
f(Z) > 0, then we can construct a separating hy- 
perplane as in Lemma [TOl 

Suppose that after solving the program, we get 
a point Z e C such that f(Z) < 0, i.e. W vio- 
lates the constraint (I5.3P for Z . Then since con- 
straint (|5.3p is convex in W, we can construct a 
separating hyperplane as in Lemma[TU] This com- 
pletes the description of the separation oracle. 

Working out the details carefully yields the following 
theorem, proved in Appendix [El 

Theorem 11. There is an iterative algorithm with 
0{t b K A log 2 (^j-)) iterations, each involving one call to 
AM.O and 0(t 2 K 2 ) processing time, that either de- 
clares correctly that A is infeasible or outputs a distri- 
bution P over policies in LI such that Wp satisfies 

MZ eC: 



E 



EZ(x, a) 
WL 



W' P (x,a) 



<max{4 J fC,/3 t A t _ 1 (Z) 2 } + 5e 



A t _i(W0 < s + 2 7 , 



W'(x,a) 



u ■ Z . Now, the problem 



where e ~ and 7 = — . 

1 Ht 



6 DELAYED FEEDBACK 

In a delayed feedback setting, we observe rewards with 
a t step delay according to: 

1. The world presents features Xt- 

2. The learning algorithm chooses an action at € 
{1, ...,#}• 

3. The world presents a reward rt- T for the action 
at-r given the features xt- T - 



Algorithm 3 DelayedPE(JI,S,K,Dx-,t) 
Let LTo = II and history ho = 

Define: S t = S / 4Nt 2 and b t = 2 



2ifln(l/5 t 



1 



ln(lA) 



Define: u f = min \ . , 

P | 2if ' V 2i^i 

For each timestep t = 1 . . . T, observe ajj and do: 

1. Let t' = max(i — r, 1). 

2. Choose distribution P t over LI t _i s.t. V tt € IL_i: 

1 



E 



< 2Jf 



.(l-^Atf)W ft (a!,7r(x)) +^ 

3. V a e A, Let W((a) = (1 - #> t ,)W> t (a; t , a) + /v 

4. Choose a t — W/ 

5. Observe reward rt- 

6. Let n t = |tt e n 4 _i : 

ffcM > ( max »7/i(7r / )j - 26 t / f 
\7r'en t -i / J 

7. Let /it = /i t _i U (x t ,a t , r t ,W((a t )) 



We deal with delay by suitably modifying Algorithm [T] 
to incorporate the delay r, giving Algorithm [3l 

Now we can prove the following theorem, which shows 
the delay has an additive effect on regret. 

Theorem 12. For all distributions D over (x,r) with 
K actions, for all sets of N policies LI, and all delay 
intervals t, with probability at least 1 — 5, the regret of 
DelayedPE (Algorithm^) is at most 



16\/2Aln- 



4T 2 7V 



(t + VT) 



Proof. Essentially as Theorem 01 The variance bound 
is unchanged because it depends only on the context 
distribution. Thus, it suffices to replace Et-i ~T° w ith 



r + Et=r + l vh = T + E*=i jr t ^ Eq. (33 



□ 
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A Concentration Inequality 



The following is an immediat e corollary of Theorem 
1 of ( Beygelzimer et all . l201l[ ). It can be viewed a s 
a version of Freedman's Inequality (jFreedmanl . Il975l) . 
Let yi, . . . ,yx be a sequence of real- valued random 
variables. Let E t denote the conditional expectation 
E[ ■ 1 • • • ,Vt-i] and Vt conditional variance. 



Theorem 13 (Freedman-style Inequality). Let V,RG 

K such that Et=i ^M2/t] — ^> an d f or a ^ t> Vt ~ 
Et[yt] < R. Then for any 5 > such that R < 



\JV I ln(2/<5), with probability at least 1 — 6, 



t=i t=i 



< 2y/V\n(2/8) 



B Minimax Theorem 



The following is a continu ous version of Sion's Mini- 
max Theorem ( Sion , 19581 Theorem 3.4). 



Theorem 14. Let W and Z be compact and convex 
sets, and f : W x Z — » R a function which for all 
Z £ Z is convex and continuous in W and for all 
W £ W is concave and continuous in Z . Then 



min max f(W. Z) — max min f(W. Z) 

wew zgz zez wew 



C Empirical Variance Bounds 

In this section we prove Theorem[SJ We first show uni- 
form convergence for a certain class of policy distribu- 
tions (Lemma ll5[) . and argue that each distribution P 
is close to some distribution P from this class, in the 
sense that Vp^,t is close to Vp t and Vp^,t is close 

to Vp t (Lemma I16|. Together, they imply the main 
uniform convergence result in Theorem [6] 

For each positive integer to, let Sparse[m] be the set of 
distributions P over IT that can be written as 



P( 7r ) = -^l( 7 r = 7 r i ) 



(i.e., the average of to delta functions) for some 
7Ti,...,7r m £ IT. In our analysis, we approximate 
an arbitrary distribution P over II by a distribution 
P £ Sparse [rn] chosen randomly by independently 
drawing 7Ti,...,7r m ~ P; we denote this process by 

Lemma 15. Fix positive integers (mi, m2, . . . ). With 
probability at least 1 — S over the random samples 
(xi,x 2 ,...) from D x , 



Vp^ t <(l + X)-Vp 



P.TT.t 



1 \ (TO t + l)lQgjV + l0g^- 



2A 



for all A > 0, all t > 1, all 7r £ IT, and all distributions 
P £ Sparse[m t ]. 



Proof. Let 



1 



{l-KiM)Ws(x,ir{x))+fH 



SO V P,K,t = ®*~D X [ Z p,„, t ( X )] aild %,-n.t = (* 

l^YlXZpAxi), Also let 



log(|Sparse[TO t ]|iV2i 2 / ( 5) 

fH-(t- 1) 
((rot + l)logiV + log^ 



IH ■ (* - 1) 

We apply Bernstein's inequality and union bounds 
over P £ Sparse[TO t ], it £ IT, and t > 1 so that with 
probability at least 1 — 5, 



v?,«,t <^ f + v /2 ^ £t + (2/3)et 

all i > 1, all 7r € IT, and all distributions P £ 
Sparse[m t ]. The conclusion follows by solving the 
quadratic inequality for Vp . to get 



Vp^t< V p,,,t+^ 2V p^t + 5e t 



and then applying the AM/GM inequality. 



□ 



Lemma 16. Fix any 7 £ [0, 1], and any x £ X . For 
any distribution P over IT and any tt £ IT, if 

6 



7 2 Mt 



then 



_ E 



(1 - K(i t )Wp{x, tt(x)) + m 



1 



(l-l^WM^^+Mt 

7 



< 



(l-^ t )W P (a;,7r(a;))+/i t ' 

TTm's implies that for all distributions P over IT and 
any tt £ IT, i/iere exists P £ Sparsefm] such that for 
any A > 0, 

(Vp„, - Vp^) + (1 + A) (Vp v t - V P ^ t ) 

< l{Vp^,t + (1 + X)V P ,*,t). 

Proof. We randomly draw P ~ P m , with P(n') = 
mT 1 Y^iLi ■"■( 7r ' = ""i)) an( i then define 



2 



= ^ P(tt') • I(tt'(x) = tt(x)) and 
7r'en 

= £py)-I(7r'(x)=7r(x)). 



Tr'en 



We have z = E7r'~p[I(7r'(x) = 7r(x)] and z = 
m^ 1 J^™ 1 1(7Tj(x) = 7r(x)). In other words, I is the 



average of m independent Bernoulli random variables, 
each with mean z. Thus, Ep^ pm [(z— z) 2 ] — z(l—z)/m 
and Pip^ pm [z < z/2] < exp(— mz/8) by a Chernoff 
bound. We have 



_ E 
p~p*< 



1 



< E 
p~P" 

< _ E 



(1 -Kfi t )z + fi t (1- 
(l-^Mt)|i 



-z| 



[(i-ir Mt )2 + /i { ][(i-i<r Mt >- 

(1 - - z\l(z > 0.52) 



_ E 
p~P" 



0.5[(l-^ t )^ + Mt] 2 
(1 - lf/Zt)|2 - z\I(z < 0.5z) 



/^t[(l - Kfi t )z + /i t ] 



< 



(l-#Mt)VlE 



p~P" 



z ■ 



z 



0.5[(l-^ t )2 
(1- A> t )zPr 



P ^ pm (z < 0.5z) 



fit[(l - Kfj, t )z + fM.] 



< 



(1 - Kfi t )\Jz/m 



0.5[2^(l-Kfx t )zfx t ][(l - Kfi t )z 
(1 — K fj, t ) z exp(— mz/8) 



+ 



< 



fi t [{l - Kfi t )z + fi t ] 
7\/l - Kfi t \J z/m 



y/z(6/m)[(l - Kfi t )z + in] 
(1 — K fit)^f 2 mz exp(— mz/8) 
6[(1 - A^i t )z + A*i] 

where the third inequality follows from Jensen's in- 
equality, and the fourth inequality uses the AM/GM 
inequality in the denominator of the first term and 
the previous observations in the numerators. The fi- 
nal expression simplifies to the first desired displayed 
inequality by observing that mz exp(— mz/8) < 3 for 
all mz > (the maximum is achieved at mz = 8). The 
second displayed inequality follows from the following 
facts: 

e \Vp, T , t -Vp vt \<'rVp, v ,t, 



p~p" 



p~p" 



(1 + \)\Vp,«, t - V ? ^ t \ < 7 (1 + \)Vw 



Both inequalities follow from the first displayed bound 
of the lemma, by taking expectation with respect to 
the true (and empirical) distributions over x. The de- 
sired bound follows by adding the above two inequal- 
ities, which implies that the bound holds in expecta- 
tion, and hence the existence of P for which the bound 
holds. □ 



Now, we can prove Theorem [6] 
Proof of Theorem Let 
mt 



6 1 



(for some A £ (0, 1/5) to be determined) and condition 
on the > 1 — S probability event from Lemma [15] that 



V, 



;. T/ -(1 + A)^,. 

1 

2A 
1 

+ A 



< K- ^5 

< K ■ 5 ( 1 



(m« + 1) log(JV) + log{2t 2 /S) 

Km ■ (t - 1) 
(m t + 1) log(AQ + log(2t 2 /,5) 
Kfi t ■ t 



for all t > 2, all P € Sparse[m t ], and all 7r € II. Using 
the definitions of mt and fit , the second term is at most 
(40/A 2 )-(1 + 1/A)-Afor all t > 16Klog(8K N/S): the 
key here is that for t > 16K\og(8KN/S), we have 
fit = y/\og(Nt/8)/{Kt) < 1/{2K) and therefore 



m t log(JV) 6 



logjAQ +\og{2t 2 /5) 
Kfitt 



< 2. 



Now fix t > l6K\og(8KN/5), it € n, and a distribu- 
tion P over II. Let P E Sparse[m t ] be the distribution 
guaranteed by Lemma [TBI with 7 = A satisfying 



Vi 



P,ir.t 



< 



P.n.t 



(! + x )Vp.,,t + (1 + A) 2 Up,.,t 



1 - A 



Substituting the previous bound for Vp n t — (1 + 



1 - A VA 



40 



-2(l + l/A)A+(l + A) 2 y P ,^ 



K 
□ 



This can be bounded as (1 + e) ■ Vp^.t + (7500/e 3 ) 
by setting A = e/5. 

D Analysis of RandomizedUCB 
D.l Preliminaries 

First, we define the following constants. 

• e € (0, 1) is a fixed constant, and 

• P = -^ir i s the factor that appears in the bound 



from Theorem [BJ 



• = (p+l)/(l-(l + e)/2) = T ^(l+™°) >5 
is a constant central to Lemma I21[ which bounds 
the variance of the optimal policy's estimated re- 
wards. 



Recall the algorithm-specific quantities 

'Nt" 
S 



C t = 21og 



2K' 



C t 
2Kt 



It can be checked that /i t is non-increasing. We define 
the following time indices: 



• < is the first round t in which /j, t — J C t /(2Kt). 
Note that 8K < t < 8Klog(NK/S). 

• ti := {WKlog(8KN/S)] is the round given by 
Theorem [5] such that, with probability at least 



E 



1 



Wl(n(x t )) 



<(l + e) 



1 



Wp UIH (x,tt(x)) 



pK (D.l) 



for all 7r G II and all t > t\, where Wp }ll (x, •) is 
the distribution over A given by 

W P ^(x,a) = {l-Kn)W P (x,a)+fj l , 

and the notation Ex~/h-i denotes expectation 
with respect to the empirical (uniform) distribu- 
tion over x\,.. . ,Xt-i- 



D.2 Deviation Bound for r]t(ir) 
For any policy tt g II, define, for 1 < t < to 
Vt(n)=K, 

and for t > t , 

1 



Wl(n(x t )) 



The Vt(7r) bounds the variances of the terms in 774(71"). 

Lemma 18. Assume the bound in (|D.1[I holds for all 
tt € IT and t > t\. For all tt € IT: 



7. 7/i < ti, rten 
2. Ift>ti, then 



<(l + e) E 

1 



# < Vt(7r) < 4ftT. 



1 



(l-if/i t )iyp t (x,7r(a;))+/i t 



+ (/9+l)iv. 



The following lemma shows the effect of allowing slack 
in the optimization constraints. 

Lemma 17. If P satisfies the constraints of the opti- 
mization problem (14. ip with slack K for each distribu- 
tion Q over II, i.e., 



E E 

7T~Q X~/l t _ 



< max |4ivT, 
/or a/i Q, then P satisfies 



E E 

7T~Q x^h±- 



1 



(1 -tf/i t )Wp(a:,7r(a:)) + /i< 
(i-l)A t _x(W Q ) 



180C t - 



11 



(1 -Kp t )Wp(x, tt{x)) +n t _ 

(t - l)A t ^(W Q ) 2 



< max < 5K, 



144CJ-! 



/or a// Q. 



Proo/. Let 6 = max {4if, ^lloS'tl*^ I • Note that 
I > .fT. Hence 6 + if < ^ which gives the stated 



bound. 



□ 



Note that the allowance of slack K is somewhat arbi- 
trary; any 0(K) slack is tolerable provided that other 
constants are adjusted appropriately. 



Proof. For the first claim, note that if t < to, then 
V t (7T) = K, and if t < t < t\, then 



log(Nt /S) 



> 



1 



- V 16i^ 2 log(8KA^/(5) ~ 4K' 
so W/(a) > /J t > l/(4fiT). 

For the second claim, pick any t > t\, and note that 
by definition of t\, for any it € H we have 



E 

x t ~D x 



1 



<(l + e) E 

Xr^ht — 1 



1 



(1 - Kiit)Wp t (x,n(x)) + /i t 



The stated bound on Vt(7r) now follows from its defi- 
nition. □ 



Let 



(tt) = max{K(7r), r = 1,2,...,*} 



The following lemma gives a deviation bound for r/ t (tt) 
in terms of these quantities. 

Lemma 19. Pick any S 6 (0,1). With probability at 
least 1 — 6, for all pairs tt, tt' € II and t > to, we have 



(»7tM - f?t(7r')) - (?7r>(7r) - r?u(7r')) 



.. ■ 2;/ , (^max,t(7r)+V max , t (7r0)-C t 



Proof. Fix any t > to and 7r,7r' € II. Let S t := 
cxp(— Ct). Pick any t < t. Let 



Z t (tt) 



WW 



so r/t (7r) = t 1 X) T =i ^t(tt)- It is easy to see that 
E [Z t (tt) - Z t (tt')\ = 7] D {w) - vd{7t') 

(x T ,r T )~D, 
a r ~W T 

and 



£ E [(Z r (7T)-Z r (70) 2 ] 

a,.~w; 



< y e 

T— 1 



1 



1 



Wf(7r(a; r )) ^(tt'(x t )) 



Moreover, with probability 1, 



|Z r (7T) - Z t (tt')| < 



/'r 



Now, note that since t > io, Mt — \fjkt> so that 
* = 2^>f ' Furt her, both Knax,t(7r) and Vmax^Tr') are 
at least K, Using these bounds we get 



1 



log(iM) 



>,/i.-^.2tf=l>J- 



C t 2K\i* 



fJ-t 



for all r < t, since the /v's are non-increasing. There- 
fore, by Freedman's inequality (Theorem ll3[) . we have 



Pr 



> 2 



(»7t0) - %(tt')) - (^d(tt) - J?r>(7r')) 



(Kiax,t(7I") + F max ,t(7r')) • log(l/5 t ) 



t 



< 2S t 



The conclusion follows by taking a union bound over 
t <t <T and all pairs ir, w' G II. □ 

D.3 Variance Analysis 

We define the following condition, which will be as- 
sumed by most of the subsequent lemmas in this sec- 
tion. 



Condition 1. The deviation bound (|D.1|) holds for 
all 7r G II and t > t%, and the deviation bound (|D.2j) 
holds for all pairs n, tt' € II and t >t n . 



The next two lemmas relate the Vt (tt) to the A t (tt) . 

Lemma 20. Assume Conditional^ For any t>t\ and 
tt G n, ifV t (ir) > OK, then 



A t _i(7r) > 



, 72V r t (7r)C, 



t- 1 



Proof. By Lemma [TSJ the fact Vt(Tr) > OK implies 
that 



E 



1 



(1 - KfJk)W Pt (x,Tr(x)) + fM 



Since Vt (?r) > > 5 if, Lemma IT7l implies that in or- 
der for Pt to satisfy the optimization constraint in ()4.f |) 
corresponding to 7r (with slack < X), it must be the 
case that 



> 



■ * • E 

I - 1 a:~ft«_i 



(1 - Kfit)W Pt (x,Tr(x)) +Mt 



Combining with the above, we obtain 



A t _i(7r) > 



'72Vi(7r)C t 



t — 1 



□ 



Lemma 21. Assume Condition [JJ For a// f > 1, 

V r max,t(7rmax) < OK and V max ,t(Tr t ) < OK. 

Proof. By induction on t. The claim for all t < t\ fol- 
lows from Lemma 1181 So take t > t\, and assume as 
the (strong) inductive hypothesis that V maXjT (7T max ) < 
OK and V maX:T (ir T ) < 6K for r G {1, . . . , t — 1}. Sup- 
pose for sake of contradiction that 14(-7r max ) > #iC By 
Lemma [2111 



A t _i(7T max ) > 



/ 72y t (7r max )C t _ 



t- 1 



However, by the deviation bounds, we have 
A t _i(7r max ) + A_D(7r t _i) 



< 24 



(Vmax,t-l(7Tt-l) + Vmax,t-1 ("'iiiax))Ci- 
< - 1 



< ^ / 2Vt(7r max )C t _i /72V r t (7r max )Ct_i 



t- 1 



<- 1 



The second inequality follows from our assumption and 
the induction hypothesis: 

Vt(TTmax) > OK > F m ax,t- 1 {^t-l ) , Knax,t- 1 (^max) ■ 

Since Ai)(7r t _i) > 0, we have a contradiction, so 
it must be that Vt(7r max ) < OK. This proves that 

Vmax,t(7Tmax) < OK. 

It remains to show that V max ,t(7Tt) < OK. So sup- 
pose for sake of contradiction that the inequality fails, 
and let t\ < r < t be any round for which V T (~Kt) — 
V maXlt (n) > OK. By Lemma HU 



On the other hand, 

A T _l(7T t ) < A D (7T T _l) + A T _l(7T t ) + A t (7T max ) 
= ^Ad(7T t _i) + A r _l(7T max )) 

+ (A D (7T f ) + A f (7T max )). 

The parenthesized terms can be bounded using the 
deviation bounds, so we have 

A r _l(7T t ) 



<2\ 



(Vmax,T— i(Tt— l) + Vmax,T— l('7'max))Cv— 1 
T - 1 

_|_ 2 / (ymax,T-l(7Tf) + Knax^-i (7T max ))C r 



(Kiax,t(7Tt) + Vma^^TTmax))^ 



2V T (* t )C T -i . 2V r {-K t )C T . X 



T- 1 



r - 1 



2V T (7Tt)C t 



<1 



f 72T/ T (^)C T , 
r- 1 



where the second inequality follows from the following 
facts: 

1. By induction hypothesis, we have 

^niax,r- 1 (t^t— l): Knax,T-l (^max)i Knax,t(^max) 

OK, and K(7rt) > dif, 

2. V T (iTt) > V max ,t(n), and 

3. since r is a round that achieves V^n aXi t(7r t ), we 
have V T (TT t ) > V T -i(-K t ). 



This contradicts the inequality in (|D.3[) . so it must be 
that F maX;t (7r t ) < OK. □ 

Corollary 22. Under the assumptions of Lemma \21[ 



A D (ir t ) + A t (vr max ) < 2 



t 



for all t > to. 



Proof. Immediate from Lemma [5T] and the deviation 
bounds from (|D.2|) . □ 



The following lemma shows that if a policy it has large 
A T (n) in some round r, then A t (7r) remains large in 
later rounds t > r. 

Lemma 23. Assume Condition [3 Picfc any 7r G II 
and t>t\. If V maXt t{^) > OK, then 



A t (7r) > 2 



2Knax,t(7r)Ct 



Proof. Let r < t be any round in which V^-tV) 
K iax ,t(7r) > OK. We have 

A t (7T) > A t (7r) - A t ( '"'max 

= A r _i(7r) + (?7t(7r max ) - r] t (ir) - A D (7r)j 

+ (vd(^t-i) - Vd(^) - A T _i(7r)) 



> 



' 72V T (ir)C T - 1 

T - 1 



-21 



(Knax,t(7T) + Kiax^lTrmax))^ 



(Vinax.r-lW + Vmax.T- 1 (flY-l ))CV- 
T - 1 



/72V maXit (7r)CV-i ^ /2Vmax,t(7r)Ci 



T — 1 



-24 



2V r ma x,t(7I')C T -l 



T — 1 



> 2 / 2Vinax,t(7r)C' T _i > /21 ui:,x./ f 7T K '/ 



where the second inequality follows from Lemma [201 
and the deviation bounds, and the third inequality 
follows from Lemma [5T] and the facts that V T (it) = 
Vmax.tM > OK > K naXjt (7r max ),y maX!T _i(7r T _i), and 
Vmax,*(7r) > V r max , r _i(7r). □ 



D.4 Regret Analysis 

We now bound the value of the optimization prob- 
lem (|4.ip . which then leads to our regret bound. The 
next lemma shows the existence of a feasible solution 
with a certain structure based on the non-uniform con- 
straints. Recall from Section [SJ that solving the opti- 
mization problem A, i.e. constraints (|5. 1115. 2115.31) . for 
the smallest feasible value of s is equivalent to solving 
the RUCB optimization problem (|4.ip . Recall that 



R — t ~ 1 
^* 180C t _i' 

Lemma 24. There is a point W € R^* -1 ^ such that 



A t _i(W) < 4^ 
W e C 



VZeC: E 

x~h t 



EZ(x, a) 
W 7 



W'{x,a) 

In particular, the value of the optimization prob 
lem OPT t , is bounded by 



f <110 



KCt- 



Proof. Define the sets {Ci : i = 1, 2, . . .} such that 
C l := {Z G C : 2 4+ V < A t _ x (Z) < 2 1+2 k}, 

where k — yfj^- Note that since A t _i(Z) is a linear 
function of Z, each Ci is a closed, convex, compact 
set. Also, define C = {Z G C : A t -i(Z) < 4k}. 
This is also a closed, convex, compact set. Note that 

Let J = {i : Ci ^ 0}.For i 6 J \ {0}, define w 4 = 4"\ 
and let wo = 1 — 2ie/\{o} Wi - Note that u>o > 2/3. 

By Lemma [TJ for each i E I, there is a point Wi G C, 
such that for all Z G Ci, we have 



E 



E^(x, a) 



< 2K. 



Here we use the fact that A/i t < 1/2 to upper 
bound Y^TCt ^ Now consider the point W — 

J2iei w iWi- Since C is convex, W £ C. 

Now fix any i € I. For any (x, a), we have W'(a;, a) > 
WiW-(x, a), so that for all Z G Ci, we have 



E 

x~h t - 



EZ(x, a) 
W'(x, a) 



1 




< — 


2A 






< 4 H 


*A 



<max{4A',/3 t A t „ 1 (Z) 2 }, 
so the constraint for Z is satisfied. 



Finally, since for all i G /, we have u>i < 4 1 and 
At-i(Wi) < 2 i+2 K , we get 

OO 

A t _i(W) = ^ w i A t _ 1 (W i ) < ■ 2 i+2 « < 8k - 



i=0 



□ 



The value of the optimization problem (|4.1[) can be 
related to the expected instantaneous regret of policy 
drawn randomly from the distribution P t . 
Lemma 25. Assume Condition^ Then 



KCt-i 
t-1 



2e 



opt,£ 



^ P t (ir)A D (ir) < (22O + 4V20 
for all t > t\ . 



Proof. Fix any 7r G II and t > t\. By the deviation 
bounds, we have 



< A t _!(7r) + 2 



(Knax,t-lW + V r maX! t_i(7r t _i))C t _ 



t - 1 



< A i _ l ( 7 r) + 2 1 



(ftnax.t-lW +0*0^-1 
t-1 



by Lemma |2"T1 By Corollary [2U we have 



Ar>(7rt-i) <2 



20KC t -i 
t - 1 



Thus, we get 

AdM < (?7_D(7Tt-l) - ?/d(tt)) + A_D(7T t _l) 



< A i _i( 7 r) + 2 1 



(WiW+^)C ( -i 



t- 1 



2<9AC t _i 



f - 1 

If Knax,t-i(7r) < 0A, then we have 



Ad(tt) < At-i(Tr) +4 



29KC t -i 
t-1 



Otherwise, Lemma [2U1 implies that 

(t-1). A^tt) 2 



so 



A d (tt) < A t _i(7r) + 2 



8C t _i 



At_x(7r)2 BKC^ 
8 t-1 



26KC t -i 
t- 1 



< 2A i _ 1 ( 7 r) +4 



26»AC t _i 
t-1 



Therefore 



?ren 



<2^P t ( 7 r)A t _ 1 (7r) + 4 



Tren 



29KC t _x 
t-1 



< 2 (OPT t +e op t,t) + 4 



t - 1 



where OPT t is the value of the optimization prob- 
lem (14.111. The conclusion follows from Lemma (241 □ 



We can now finally prove the main regret bound for 
RUCB. 

Proof of Theorem [3J The regret through the first t\ 
rounds is trivially bounded by t\. In the event that 
Condition Q] holds, we have for all t>t±, 



J2w t (a)r t (a) > £(1 - K^ t )W Pt (x t , a)r t (a) 

a£A 

> W Pt (x t ,a)r t (a) - Kfi t 



wen 



and therefore 
E 

Ot,F(t))~D 
<H~W! 



i r t(a t )} 



E 

(x t ,r(t))~D 



^2WUa)r t (a) 



_a£A 



?ren 



> VD^max) - O \ \j _ + £ pt,t 



where the last inequality follows from Lemma 
Summing the bound from t = t\ + 1, . . . , T gives 



^ E [VD^max) - r t (a t )] 

at-Wi 



<h + ^TK log {NT 1 5) 



By Azuma's inequality, the probability that 
Y^t=i r t( a t) deviates from its mean by more than 
0{yjT log(l/<5)) is at most 5. Finally, the probability 
that Condition Q] does not hold is at most 25 by 
Lemma I19[ Theorem [6l and a union bound. The 
conclusion follows by a final union bound. □ 



E Details of Oracle-based Algorithm 

We show how to (approximately) solve A using the 
ellipsoid algorithm with AM.O. Fix a time period t. 
To avoid clutter, (only) in this section we drop the 
subscript t—1 from rjt—i(-), A t _i(-), and ht-\ so that 
they becomes r)(-), A(-), and h respectively. 

In order to use the ellipsoid algorithm, we need to 
relax the program a little bit in order to ensure that 
the feasible region has a non- negligible volume. To do 
this, we need to obtain some perturbation bounds for 
the constraints of A. The following lemma gives such 
bounds. For any d > 0, we define Cs to be the set of 
all points within a distance of S from C. 

Lemma 26. Let S < 6/4 be a parameter. Let U,W £ 
C28 be points such that \\U — W\\ < 5. Then we have 



\A(U) - A(W)| < 7 
VZ e Ci : 

Z(x, a) 



E 



U'{x,a) 



- E 



Z(x, a) 



W'(x,a) 



(E.l) 

< e 
(E.2) 



where e = ^4 and 7 = — . 



Proof. First, we have 

\ v (U)- V (W)\<-±- Yl L \U(x,a)-W(x,a) 

t—1 * 4 P 



(x,a,r,q)£h J 



< — = 7, 



which implies (|E. L |) . 

Next, for any Z <E Ci, we have 

Z(x, a) 



U'(x,a) ^W'(x,a) 

U'(x,a) - W'(x,a)\ 
U'{x,a)W'{x,a) 



< 



8S_ 

Ml 



In the last inequality, we use the Cauchy-Schwarz in- 
equality, and use the following facts (here, Z(x, •) de- 
notes the vector (Z(x,a)) a , etc.): 

1. \\Z(x, •)!! < 2 since Z G C t , 

2. \\U'(x,-) - W'(x,-)\\ < \\U(x,-) - W(x,-)\\ < 6, 
and 

3. U'(x, a) > (1 - bK) ■ (-28) + b> b/2, for 5 < 6/4, 
and similarly W'(x,a) > b/2. 



□ 



This implies JR2 



We now consider the following relaxed form of A. 
Here, S £ (0,6/4) is a parameter. We want to find 
a point W £ R'*" 1 ^ such that 



A(W) < s + 7 
W £ C s 
VZ £ C 2 s : 

Z(x, a) 



E 



E 



W'(a;,a) 



(E.3) 
(E.4) 

<max{4K,/3 t A(Z) 2 } + e, 



where e and 7 are as defined in Lemma [ 
relaxed program A' . 



(E.5) 
Call this 



We apply the ellipsoid method to A' rather than A. 
Recall the requirements of Lemma [SJ we need an en- 
closing ball of bounded radius for the feasible region, 
and the radius of an enclosed ball in the feasible region. 
The following lemma gives this. 

Lemma 27. The feasible region for A' is contained in 
B(0,\/~t + 8), and if A is feasible, then it contains a 
ball of radius 8. 

Proof. Note that for any W £ Cs, we have ||W|| < 
y/i+S, so the feasible region lies in B(0,y/i+5). 

Next, if A is feasible, let W* £ C be any feasible solu- 
tion to A. Consider the ball B(W*,6). Let U be any 
point in B(W*,S). Clearly U £ C s . By Lemma EH 
assuming 8 < 1/2, we have for all Z £ C26, 



E 



Z(x, a) 
^ U'(x,a) 



V 



< 



Z(x, a) 



< max{4AT, f3 t A{Z) 2 } + e. 



Also 



A(U) < A(W*) + 7 < s + 7. 

Thus, U is feasible for A', and hence the entire ball 
B(W*,5) is feasible for A'. □ 

We now give the construction of a separation oracle for 
the feasible region of A' by checking for violations of 
the constraints. In the following, we use the word "iter- 
ation" to indicate one step of either the ellipsoid algo- 
rithm or the perceptron algorithm. Each such iteration 
involves one call to AMO, and additional 0(t 2 K 2 ) 
processing time. 

Let W £ R {t ~ 1)K be a candidate point that we want to 
check for feasibility for A'. We can check for violation 
of the constraint (|E.3I) easily, and since it is a linear 
constraint in W, it automatically yields a separating 
hyperplane if it is violated. 



The harder constraints are (|E.4[) and (|E.5j) . Recall 
that Lemma E] shows that that AMO allows us to do 



linear optimization over C efficiently. This immedi- 
ately gives us the following useful corollary: 

Corollary 28. Given a vector w £ IB^* -1 ^ and 8 > 0, 
we can compute argmax^gc^ w ■ Z using one invoca- 
tion of AMO. 



Proof. This follows directly from the following fact: 



arg max w ■ Z 

zed 



- — -w + arg max w ■ Z. 

M zee 



□ 



Now we show how to use AMO to check for constraint 



Lemma 29. Suppose we are given a point W . Then 
in 0(4z) iterations, if W ^ C28, we can construct a 
hyperplane separating W from Cs- Otherwise, we de- 
clare correctly that W £ C28 • In the latter case, we can 
find an explicit distribution P over policies in H such 
that W P satisfies || W P - W\\ < 28. 



Proof. We run the perceptron algorithm with the ori- 
gin at W and all points in C5 being positive exam- 
ples. The goal of the perceptron algorithm then is to 
find a hyperplane going through W that puts all of Cs 
(strictly) on one side. In each iteration of the percep- 
tron algorithm, we have a weight vector w that is the 
normal to a candidate hyperplane, and we need to find 
a point Z £ Cs such that w ■ (Z — W) < (note that 
we have shifted the origin to W). To do this, we use 
AMO as in Lemma[5]to find Z* = argmax^gc^ —w-Z. 
If to ■ (Z* — W) < 0, we use Z* to update w using the 
perceptron update rule, w <— w + (Z* — W). Other- 
wise, we have w ■ (Z — W) > for all W £ Cs, and 
hence we have found our separating hyperplane. 

Now suppose that W $ C2S; i-e. the distance of W 
from Cs is more than 8. Since \\Z — W\\ < 2^/t + 
35 = 0{V~t) for all W £ Cs (assuming 8 = O(Vt)), 
the perceptron convergence guarantee implies that in 
O(-m) iterations we find a separating hyperplane. 

If in k — 0(4z) iterations we haven't found a separat- 
ing hyperplane, then W £ C2S- In fact the perceptron 
algorithm gives a stronger guarantee: if the k poli- 
cies found in the run of the perceptron algorithm are 
7Ti, 7T2, . . . ,7Tfe £ II, then W is within a distance of 28 
from their convex hull, C = conv(7Ti, 7T2, . . . , 7Tfc). This 
is because a run of the perceptron algorithm on C' 2S 
would be identical to that on C2S for k steps. We can 
then compute the explicit distribution over policies P 
by computing the Euclidean projection of W on C in 



poly(fc) time using a convex quadratic program: 
min llW-Etx^ll 2 

i 

Mi : P t > 

Solving this quadratic program, we get a distribution 
P over the policies {tti, 7T2, . . . , 7Tfc} such that ||Wp — 
W||<2<$. □ 



Finally, we show how to check constraint (|E.5[) : 

Lemma 30. Suppose we are given a point W. In 
0( - tt ■ log(4)) iterations, we can either find a point 
Z 6 C28 such that 



E 

Xr^h 



EZ(x, a) 
W( 



W'(x,a) 



>max{4if,/3 t A(Z) 2 } + 2e, 



or else we conclude correctly that for all Z € C, we 
have 



E 



w 



Z(x, a) 



W'[x,a) 



< m&x{4K,(3 t A(Z) 2 } + 3e. 



Proof. We first rewrite rj{W) as T)(W) — w ■ tt, where 
to is a vector defined as 



w(x, a) = 



1 



t- 1 



E 



(x / ,a / ,r,p)E/t: x'—x,a'—a 



Thus, A(Z) = v — w ■ Z , where v = max T / ^K 71 "') 
m ;i x ... w ■ 7r' which can be computed by using AM.O 
once. 

Next, using the candidate point W, compute the 



vector u defined as u(x, a) 



where 



W'{x,a) 

is the number of times x appears in h, so that 



Ex~h 



EZ(x,a) 
a W'(x,a) 



= u- Z . Now, the problem reduces 



to finding a point R £ C which violates the constraint 
u ■ Z < max{4fT, f3 t (w ■ Z - v) 2 } + 3e. 

Define 

f(Z) = max{4X, f3 t (w ■ Z - v) 2 } + 3e-u-Z. 

Note that / is convex function of Z. Checking for vi- 
olation of the above constraint is equivalent to solving 
the following (convex) program: 



f(Z) < 

z e C 



(E.6) 
(E.7) 



To do this, we again apply the ellipsoid method, but 
on the relaxed program 



f(Z) < e 

z e c s 



(E.8) 
(E.9) 



To run the ellipsoid algorithm, we need a separation 
oracle for the program. Given a candidate solution Z, 
we run the algorithm of Lemma 125) and if Z ^ C28, we 
construct a hyperplane separating Z from C$. 

Now suppose we conclude that Z E €25- Then we 
construct a separation oracle for (|E.6|) as follows. If 
f(Z) > e, then since / is a convex function of Z, we 
can construct a separating hyperplane as in Lemma lTOl 

Now we can run the ellipsoid algorithm with the 
starting ellipsoid being B(0,y/i). If there is a point 
Z* e C such that f(Z*) < 0, then consider the ball 



B(Z\ 



48 



For any Y G B(Z*, 



45 



we have 



(u-z*)-(u-y)\<\\u\\\\z*-y\\<~ 



since Hull < Also, 



A I (to • Z* - v) 2 - (w ■ Y - v) 2 \ 
= Pt\{wZ*-w Y){w ■ Z* + w-Y 



2v) 



<^\\w\\\\z*-y\\(\\w\\(\\z*\\ + \\y\\) + 2\v\)<-, 

since ||u>|| < i, ||Z*|| < yft, \\Y\\ < y/i+S < 2y/i, and 
\v\ < \\w\\ ■ jl< 

Thus, f(Y) < f(Z*) + e < e, so the entire ball 



B(Z*, 



45 



-) is feasible for the relaxed program. 



By Lemma [SJ in 0(t 2 K 2 ■ log(^)) iterations of the 



5VtK/3 t ■ 

mma [5J 

ellipsoid algorithm, we obtain one of the following: 
1. we either find a point Z G C28 such that f(Z) < e, 



E 



Z(x, a) 



^ W 



x,a) 



> max{AK,f3 t A{Z) 2 } + 2e, 



2. or else we conclude that the original convex pro- 
gram (|E.6IE.7[) is infeasible, i.e. for all Z G C, we 

have 



E 

Xr^h 



EZ(x, a) 
W'(x,a) 



< max{4K,/3 t A(Z) 2 } + 3e. 



The total number of invocations of iterations is 



bounded by 0{t 2 K 2 • log(*f )) • O(^) = 0{ 
log(f))- 



S 2 



□ 



Lemma 31. Suppose we are given a point Z 6 C25 
such that 



E 



EZ(x, a) 



W'(x,a) 



> m&x{4:K,f3 t A(Z) 2 } + 2e. 



Then we can construct a hyperplane separating W 
from all feasible points for A' . 

Proof. For notational convenience, define the function 

Z(x, a) 



fz{W) := E 

x^h 



E 



W'{x,a) 



-max{4X, /3 t A(Z) 2 }-2e. 



Note that it is a convex function of W. Note that for 
any point U that is feasible for A', we have fz(U) < 
— e, whereas fz(W) > 0. Thus, by Lemma [TU1 we can 
construct the desired separating hyperplane. □ 

We can finally prove Theorem [TO 

Proof. [Theorem [TTJ] We run the ellipsoid algorithm 
starting with the ball B(0,^/i + 5). At each point, 
we are given a candidate solution W for program A! . 
We check for violation of constraint (IE.3|) first. If 
it is violated, the constraint, being linear, gives us 
a separating hyperplane. Else, we use Lemma [29] to 
check for violation of constraint (|E.4jl . If W $ C28, 
then we can construct a separating hyperplane. Else, 
we use Lemmas [3U] and [3T] to check for violation of 
constraint (IE.5I) . If there is a Z € C such that 



Ex 



Z(x.a) 



> max{4FT,/3 t A(Z) 2 } + 3e, then 
Else, 



W'{x,a) 

we can find a separating hyperplane. Klse, we con- 
clude that the current point W satisfies the following 
constraints: 



A(W) < s + 7 



VZ e C : E 

x-^h 



E 



Z{x, a) 



W'(x,a) 

w e C 2S 



< max{4ff, f3 t A(Z) 2 } + 3e 



We can then use the perceptron-based algorithm of 
Lemma [2U to "round" W to an explicit distribution P 
over policies in II such that Wp satisfies \\Wp — W\\ < 
23. Then Lemma[26limplies the stated bounds for Wp. 

By Lemma [SI in 0(t 2 K 2 log(4)) iterations of the el- 
lipsoid algorithm, we find the point W satisfying the 
constraints given above, or declare correctly that A is 
infeasible. In the worst case, we might have to run the 
algorithm of Lemma[30]in every iteration, leading to an 
upper bound of 0(t 2 K 2 log(f )) x • log(^f)) = 

O^K 4 \og 2 (^j-)) on the number of iterations. □ 



